The Analytics Revolution by Franks Bill

The Analytics Revolution by Franks Bill

Author:Franks, Bill
Language: eng
Format: epub
ISBN: 9781118976760
Publisher: Wiley
Published: 2014-09-12T00:00:00+00:00


Nonrelational Pillar

There is a wide variety of nonrelational platforms available. Hadoop has risen rapidly to become the most popular of the nonrelational platforms and a common component of analytical environments. Nonrelational platforms do not require data to be stored in any specific format and use a variety of programming languages in addition to some basic SQL to interface with the data. Hadoop has gained popularity due to its ability to deal with the unstructured or semistructured data that has become so common in the world of big data. In reality, all data has some structure. However, unstructured data is usually defined as data formatted in a complex way that's not easily converted into an analysis-ready form. Some examples include text, video, and audio files. Another common type of data is semistructured data, which falls in the middle between structured and unstructured data. Examples include many log files like web logs, sensor data, or the JSON data discussed earlier in this chapter. Semistructured data has defined data points, but not necessarily in any consistent order or simple format.

Hadoop handles these types of data particularly well for reasons that are discussed shortly. The fact that Hadoop is open source, and therefore has no license fee, also makes it easy to experiment with at low cost. In addition, there are commercial versions of Hadoop available from vendors such as Cloudera, Hortonworks, and MapR, as well as Hadoop appliances available from vendors such as Teradata, IBM, and Oracle. All of these offerings add value-added features on top of the base open source code.

Hadoop is different from relational technology in some important ways, led by the fact that it requires only that data files be placed on a file system. No specific format or structure of data is required for loading into Hadoop. Since Hadoop doesn't assume anything about the data files it stores, it also doesn't have any special handling for one type of a file over another.

The lack of a required format means it is possible to load text, photos, video, images, log data, sensor data, or any other type of data exactly as it comes in and then process it in parallel. This is in contrast to relational technology, where a row and column structure is assumed by default. While data with a relational structure can be placed in Hadoop, that isn't the sweet spot where Hadoop is differentiated. In fact, Hadoop is both more difficult to work with and slower to execute as compared to enterprise-class relational technology when standard relational operations are desired. This is because databases have all sorts of tools and tricks for dealing with relational data whereas Hadoop does not. Hadoop offers more flexibility with respect to data format, but the specialized functionality to deal with one format as opposed to another is lost.

One of the drivers to use Hadoop is the fact that some data is inherently more valuable than other data. For example, checking account transactions reflect money changing hands whereas a Twitter tweet is merely an opinion.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.